(0.3.1) Unify JRA55 onto the generic dataset backend, extend region support across all paths#169
(0.3.1) Unify JRA55 onto the generic dataset backend, extend region support across all paths#169simone-silvestri wants to merge 58 commits intomainfrom
Conversation
This commit reworks `src/DataWrangling/` along two related axes:
1. JRA55 is no longer a special-case data source. The bespoke
`JRA55NetCDFBackend` struct, `JRA55FieldTimeSeries` constructor, and
`InMemory()` backend support are removed. JRA55 now flows through the
same `FieldTimeSeries(::Metadata)` / `Field(::Metadatum)` API as
ECCO4 / EN4 / WOA / GLORYS, with `inpainting = nothing` selecting the
chunked-yearly NetCDF dispatch path. `JRA55NetCDFBackend(N)` is kept
as a thin function that returns a `DatasetBackend(N, nothing;
inpainting=nothing)` so existing call-sites still construct the right
backend.
2. A new `PrefetchingBackend{B<:DatasetBackend}` wraps any
`DatasetBackend` and hides the next sliding-window's I/O behind the
current window's compute via `Threads.@spawn`. Opt in via
`prefetch=true` on `FieldTimeSeries(::Metadata)`,
`DatasetRestoring(...)`, or `JRA55PrescribedAtmosphere(...)`.
New / removed
-------------
- New `src/DataWrangling/dataset_backend.jl`: the existing
`DatasetBackend{N,C,I,M}` extracted from
`metadata_field_time_series.jl`, plus its constructors / accessors
and the generic per-file `set!` used by ECCO4 / EN4 / WOA.
- New `src/DataWrangling/prefetching_backend.jl`: the
`PrefetchingBackend` wrapper, hot/cold-path `set!`, cyclical-wrap
scheduling, property forwarding so `fts.backend.start` etc. continue
to address the inner `DatasetBackend`.
- New `retrieve_data(::JRA55Metadatum)` (split per dataset type to
handle the no-leap calendar issue in multi-year files; see below) so
that the generic `Field(::Metadatum)` path produces a correct 2D
slice for any JRA55 metadatum.
- Removed `JRA55FieldTimeSeries`, the `JRA55NetCDFBackend` struct, and
the JRA55-specific `Adapt.adapt_structure`.
- Updated tests and examples to the new pattern.
Net diff: +487 / −403.
Closes #18.
----
Background — why this matters
-----------------------------
OMIP simulations on the ORCA grid surfaced periodic wall-time spikes
during time stepping:
```
[ Info: iteration: 133680, wall time: 1.780 seconds
[ Info: iteration: 133690, wall time: 23.906 seconds <-- spike
[ Info: iteration: 133700, wall time: 1.732 seconds
```
Two causes were identified on `ss/omip-prototype`:
1. `set!(fts::JRA55NetCDFFTSMultipleYears)` was opening every yearly
NetCDF file in the metadata (~60 for the full 1958–2019 atmosphere)
per reload, even files with no overlap with the current window.
This added up to ~660 NetCDF opens per reload across 11 atmospheric
variables.
2. The remaining ~15 s spike (down from ~24 s after fix 1) is the
actual cost of reading ~2 GB of compressed NetCDF across 11 files,
and cannot be reduced further without either shrinking the window
(more frequent spikes) or hiding the I/O behind compute.
On the 1° configuration, this manifests roughly as **~35 s per reload
on the cold path** and **~15 s on the hot path** (after staging files
to fast scratch and applying the per-window file filtering from
`ss/omip-prototype`). Across a year of simulation that is a few percent
of total wall time, dominated by I/O serialised against the time step.
The first fix (per-window filename filtering + per-file `ftsn_loc`) is
ported here as a plain bug fix. The second is what motivates the
`PrefetchingBackend`.
How prefetching works
---------------------
A `PrefetchingBackend` carries an inner `DatasetBackend` plus three
mutable fields: a `Task`, a buffer `FieldTimeSeries` (a clone of the
main FTS whose `data` array is the prefetch destination), and the
absolute `next_start` index that the buffer will hold once its task
completes.
When Oceananigans calls `set!(fts)` after advancing the window:
1. If the pending prefetch's `next_start` matches the requested
`start`, `wait(task)` (typically a no-op because the read finished
while the time step was running) then
`copyto!(parent(fts.data), parent(buffer_fts.data))` — a memory copy.
2. Otherwise — the cold path on first reload, on checkpointer restart,
or after an unexpected window jump — drain any stale task,
synchronously load via a one-off clone FTS, then copy.
3. Either way, schedule the next window's load:
`Threads.@Spawn set!(next_buffer_fts)` with the inner backend
re-pointed at `mod1(start + length, length(fts.times))`.
The clone-FTS approach (rather than swapping `fts.backend` to the inner
DatasetBackend in place) keeps the type of `fts.backend` stable across
reloads and lets the spawned `set!` dispatch through the existing
JRA55-specific methods without any special-casing.
JRA55 calendar caveat (multi-year)
----------------------------------
JRA55 NetCDF files use a `DateTimeNoLeap` (365-day) calendar internally,
while `all_dates(::MultiYearJRA55, name)` is a `Dates.DateTime` step
range that includes Feb 29 of leap years. `retrieve_data` is therefore
split: `retrieve_data(::RepeatYearJRA55Metadatum)` uses position-based
indexing (safe — repeat year 1990 is itself non-leap), while
`retrieve_data(::MultiYearJRA55Metadatum)` reads the file's time axis
and matches by `(Y, M, D, H, min)` components, sidestepping the
calendar mismatch entirely. A pre-existing analogous bug in
`set!(::JRA55NetCDFFTSMultipleYears)` (`file_times` and `fts.times`
diverge across leap years) is **not** addressed here — flag for a
follow-up issue.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR unifies JRA55 with the generic Metadata / FieldTimeSeries + DatasetBackend pipeline used by other datasets, and introduces an async PrefetchingBackend to overlap sliding-window I/O with compute.
Changes:
- Extract
DatasetBackendinto its own module and routeFieldTimeSeries(::Metadata)through it (including aprefetchoption). - Add
PrefetchingBackendthat preloads the next time window usingThreads.@spawn. - Refactor JRA55 to use the generic APIs (remove
JRA55FieldTimeSeriesexport; update tests/examples accordingly).
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/DataWrangling/dataset_backend.jl |
New standalone DatasetBackend definition + generic per-file set!. |
src/DataWrangling/prefetching_backend.jl |
New async backend wrapper that prefetches the next sliding window. |
src/DataWrangling/metadata_field_time_series.jl |
Routes FieldTimeSeries(::Metadata) through DatasetBackend; adds prefetch kwarg. |
src/DataWrangling/JRA55/JRA55_field_time_series.jl |
Removes bespoke backend struct; adds JRA55 retrieve_data and JRA55-specific set! on DatasetBackend. |
src/DataWrangling/JRA55/JRA55_metadata.jl |
Makes JRA55 metadata compatible with generic native-grid path; sets default_inpainting(::JRA55Metadata)=nothing. |
src/DataWrangling/JRA55/JRA55_prescribed_atmosphere.jl |
Rewrites atmosphere construction on top of generic FieldTimeSeries(::Metadata) and adds prefetch kwarg. |
src/DataWrangling/restoring.jl |
Threads prefetch=false through DatasetRestoring. |
src/DataWrangling/DataWrangling.jl |
Includes new backend modules. |
src/DataWrangling/JRA55/JRA55.jl |
Updates exports (drops JRA55FieldTimeSeries, adds JRA55NetCDFBackend). |
src/NumericalEarth.jl |
Removes JRA55FieldTimeSeries from exports. |
test/test_jra55.jl |
Updates tests to use generic FieldTimeSeries(Metadata(...)) and adds prefetching regression coverage. |
test/test_downloading.jl |
Updates JRA55 download test to use FieldTimeSeries(Metadata(...)). |
examples/inspect_JRA55_data.jl |
Updates example to the new JRA55 access pattern. |
Comments suppressed due to low confidence (1)
src/DataWrangling/prefetching_backend.jl:126
set!(::PrefetchingFTS)allocates a brand-newFieldTimeSeriesbuffer on every reload (next_fts = buffer_field_time_series(...)). For large windows this implies repeated large allocations (and GC pressure) each time the in-memory window advances, which can easily dominate runtime and memory usage.
A more scalable approach is to allocate the buffer FTS once (or use a small fixed ring of 1–2 buffers), then reuse it by updating its backend to the next window start before spawning the task, rather than constructing a fresh FTS each time.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Heh, nice idea! Just to explore designs, would an alternative implementation be to add a This would avoid the wrapper on backend and could allow us to make this a default without materialization perhaps. |
Nice idea. We could implement a PR in Oceananigans, FieldTimeSeries would need to carry the prefetch state that would be hooked in into |
…th/NumericalEarth.jl into ss/dataset-backend-prefetch
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
As a part of this PR I will temporarily disable the |
glwagner
left a comment
There was a problem hiding this comment.
A few complaints...
-
constructors like
blah_blah_ftsdon't read well in my opinion. We could useblah_blah_field_time_series. But much better would beBlahBlahFieldTimeSeries. I think this is much more obvious -
time_indices_in_memoryis a pretty awkward kwarg but maybe we are stuck with it
…th/NumericalEarth.jl into ss/dataset-backend-prefetch
|
@glwagner this is ready to merge |
Summary
Two threads, woven together:
JRA55 now uses the same code path as every other dataset. The bespoke
JRA55NetCDFBackendandJRA55FieldTimeSeriesare gone. JRA55 atmospheric variables flow throughField(::Metadatum)andFieldTimeSeries(::Metadata)like ECCO, EN4, WOA, GLORYS, ERA5, ORCA.BoundingBoxandColumnregions now work everywhere. These types existed onmain, but Column extraction was a special-cased branch inField(::Metadatum), regions weren't supported onFieldTimeSeriesat all, and JRA55 ignored them entirely. After this PR, regions work uniformly acrossField,FieldTimeSeries, every dataset including JRA55, on CPU and GPU.Closes #18.
What changes for users
Old JRA55 entry points are gone:
New, same shape as every other dataset:
Regions on any dataset, including JRA55:
The
InMemory()backend for JRA55 is dropped; if you want the whole series in memory, passtime_indices_in_memory = length(metadata).What was already on main, what we added
Already there:
BoundingBoxandColumnregion types (src/DataWrangling/metadata.jl).construct_native_griddispatches for both region types.Field(::Metadatum)accepting Column metadata via a separatecolumn_field_from_filebranch.ShiftSouth/AverageNorthSouth) for staggered files.What this PR adds:
_set_region_kernel!(src/DataWrangling/set_region_data.jl) that all read paths funnel through. It handles, in one pass:BoundingBoxoffset (straight indexed read into the clipped target),Columncyclic-aware bilinear/nearest blending,_FillValue/land cells don't poison the average,Field(::Metadatum)rewritten on top of this kernel — the specialcolumn_field_from_filebranch is gone; both regions go throughset_region_data!.FieldTimeSeries(::Metadata)and the JRA55set!paths also callset_region_data!, so regions work for time series and for JRA55 (neither did before).bracket_with_weightwith optionalperiod) so a Column at 359.5° wraps correctly across the periodic seam.Smaller cleanups along the way
FieldTimeSeries(::Metadata, grid)collapsed into a single architecture-aware comparison.DateTimeNoLeapworkaround was unnecessary; the bespoke component-matching helper is gone, replaced with a plainfindfirst. Function name and surrounding comments updated to match reality.test_column_field.jl,test_dataset_region.jl,test_mangling.jl,test_jra55_region.jl, plus additions totest_metadata.jl.Test plan
test/test_jra55.jl,test/test_jra55_region.jl— JRA55 on the new path, with and without regionstest/test_column_field.jl— Column extraction (Linear/Nearest, cyclic wrap, NaN-aware blend)test/test_dataset_region.jl— BoundingBox across datasetstest/test_mangling.jl— ECCOv_velocitymangling end-to-endtest/test_ocean_only_model.jl,test/test_ocean_sea_ice_model.jl— migrated to the new atmosphere signaturetest/test_checkpointer.jl— JRA55 atmosphere survives JLD2 round-trip on the new path